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THE PSEUDO-MARGINAL APPROACH FOR EFFICIENT MONTE 

CARLO COMPUTATIONS 
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University of Bristol and University of Warwick 

We introduce a powerful and flexible MCMC algorithm for stochas- 
tic simulation. The method builds on a pseudo-marginal method orig- 
inally introduced in [Genetics 164 (2003) 1139-1160], showing how 
algorithms which are approximations to an idealized marginal algo- 
rithm, can share the same marginal stationary distribution as the 
idealized method. Theoretical results are given describing the con- 
vergence properties of the proposed method, and simple numerical 
examples are given to illustrate the promising empirical characteris- 
tics of the technique. Interesting comparisons with a more obvious, 
but inexact, Monte Carlo approximation to the marginal algorithm, 
are also given. 

1. Introduction. We are interested in the problem of simulation from a 
probability distribution ir(d9,dz) which, for now, we shall assume admits a 
density ir(9, z) with respect to some a-finite measure (which we shall just 
write as d6 x dz). The variables 9 and z are elements of essentially arbi- 
trary spaces, and Z, respectively. We partition the state space in this way 
because, either: 

1. interest lies mainly in the marginal law ir(d9) of the variable 9 6 [which 
we shall assume, for now, admits density tt(9) with respect to d9]; or 

2. exploration of tt(9) by MCMC methods is more convenient by appropriate 
auxiliary simulation. 

In a Bayesian framework, for example, 9 could represent a parameter 
of interest and z a set of missing data or latent variables; this includes, 
among others, hidden Markov models and their continuous generalizations, 
but also mixture models, and as we shall see in the application section, 
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model selection problems in general [8]. Often the variable z is introduced 
for convenience, in particular in cases where the marginal density tt(9) is 
of sole interest. Indeed n(9), or expectations with respect to it, might be 
analytically intractable or too complex to evaluate, whereas the introduction 
of z might lead to an analytical expression, or ease the implementation of 
numerical methods. 

A relatively generic way of numerically approximating such expectations 
consists of simulating an ergodic Markov chain Zi)} which admits ir(9, z) 
as invariant probability density: such techniques are known under the acronym 
MCMC (Markov chain Monte Carlo). A typical sampling scheme will alter- 
nate sampling from the conditionals w(9\z) and n(z\9), or more generally 
ergodic Markov transition probabilities with these conditionals as invariant 
distributions. Although such so-called data augmentation schemes can very 
often ease programming and lead to elegant algorithms, it is well established 
that in numerous situations they can result in strongly positively correlated 
samples {(9i,Zi)} (see, e.g., [5, 9]), which is an undesirable property when 
efficiency is sought. On the other hand, if tt(9) was known analytically or 
cheap to compute, it would often be possible to generate "more efficient" 
samples {9{\ from a Markov chain with a transition probability P, typically 
a Metropolis-Hastings (MH) transition with invariant density tt(9) and pro- 
posal density q(9,{>). This celebrated MCMC update consists, given that 
the Markov chain is currently at 9, of proposing 9* ~ q(9, •) and set the next 
value of the chain •d = 9* with probability a(9,9*) which depends on the 
values of the densities 7r(9) and ir(9*); otherwise we set i? = 9. More details 
are given below in the form of pseudo-code (note that, in our context, we 
shall term such an algorithm a marginal algorithm). 

These general remarks have lead to the development of MCMC algorithms 
that try to combine the benefits of both approaches: possible statistical and 
computational efficiency of sampling directly from tt(9), and implementa- 
tional ease of augmented, or auxiliary, schemes. A natural approach consists 
of approximating the intractable density values n(9) and 7r(i9) required for 
the computation of the acceptance probability of the MH update with im- 
portance sampling estimates [8], that is for some integer N > 1 and some 
importance probability density q$(z) (satisfying the usual support assump- 
tion) one can consider the estimators 
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Table 1 

Comparison of the marginal, MCWM and GIMH algorithms 



Step 


Marginal 




MCWM 


GIMH 


0. Given: 

1. Sample: 


9 and tt(0) 


6 

6 


and tt(6») 
fz-?^-), 

i z*~«£(.) 


9,Z and ^(0) 
0* ~g(0,-) 
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2. Compute: 

3. Compute: r = 
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7r(e*)9(e*,e) 
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« N (e) 9 (e,e*) 
tf = 0* 
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and simply plug these estimates in the expression for the marginal accep- 
tance ratio (2.6). Note that here and hereafter we frequently omit the de- 
pendency on Z := (*(1),;<2), . . . ,*(iV)) and 3 := (3(1), 3(2), . . . ,i(N)) for 
notational simplicity. We will denote qf (Z) and q$(3) the densities of Z 
and 3- 

There are, however, several possible implementations of this idea, and we 
now review two of them. Before embarking on a more formal presentation 
in Section 2, it can be helpful to give a comparative pseudo-code description 
of MCWM and GIMH (the acronyms are explained later in the text) and 
the marginal algorithm (see Table 1). 

The first approach considered here, which corresponds to the middle col- 
umn, is to attempt to approximate P, independently at each iteration using 
the importance ratio averages given in equation (1.1). More precisely both 
Z and Z* are "refreshed" at each iteration independently of previously sam- 
pled auxiliary variables given 6 and 6*. This algorithm is referred to as the 
Monte Carlo within Metropolis (MCWM) in [1] following the terminology 
of [7]. Due to the fact that the Z J s are independent at each iteration, one 
can easily see that {6i} is still a Markov chain with transition probability 
denoted P^CWM herea f ter . 

However MCWM and the marginal algorithm P are not equivalent. In par- 
ticular, 7r(0) is typically not the invariant distribution density of p^ CWM an d 
therefore will not produce samples from tv(9) even in steady state. However, 
intuitively, provided that p^ CWM i s ergodic, the samples generated by this 
procedure will asymptotically be distributed according to an approximation 
of 7r(0), which should be all the more precise that N is large. Furthermore, 
we would like to know whether for sufficiently large N, the transition prob- 
ability pMCWM does indeed inherit the convergence properties of P, as this 
was our initial motivation. 
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An interesting variation of MCWM could consist of using a single Z, 
sampled from some probability density q^ e ,(Z), to compute both tt n (9) and 
tt n (9*). We do not pursue this here, but rather focus on the following. 

In [1], Beaumont proposes a very interesting variation on the idea above, 
called grouped independence MH (GIMH), which corresponds to the right- 
most column above. The MH transition probability of GIMH, as presented 
by Beaumont is similar in spirit to MCWM, but differs in that no fresh Z is 
sampled at every iteration. Rather, GIMH can be interpreted as a form of 
MCWM where Z is "recycled" from the previous iteration; as a result Z is 
in general not distributed according to q^ , as is the case when the MCWM 
algorithm is used. Note that in addition {9i} is not a Markov chain any- 
more, but that {9i,Zi} defines a Markov chain; we hereafter denote P^ IMH 
the transition probability of this Markov chain. 

The remarkable property noticed by Beaumont is that the acceptance 
ratio of GIMH can be rewritten as 

7T N (9*)q{9*,9) 
TT N (9)q(9,9*) 

Eti z*(k)) n£i ;t/fc qe* (z*(l)))q(9*,9)q${Z) 
[1/N Eli tt(0, z{k)) n£ Wfc qe(z(l))]q(0, 0*)<# &*) ' 

which suggests that Pjy IMH is, to complete Beaumont's argument, a MH 
algorithm with proposal density q(9,'&)q^ (3) and target density given be- 
tween the brackets above, denoted tt n (9,Z) hereafter. Hence, as soon as 
GIMH defines an irreducible and aperiodic Markov chain it will produce 
samples {9%} distributed in the limit as i — > oo according to the marginal 
tt(9). Given this interpretation, it is important to point out that updates of 
the type P^ IMH are not the only available to sample from tt n (9, Z), a point 
apparently missed in the literature. For example, it is possible to update U Z 
given 9," which is crucial to address some of the weaknesses of this approach, 
see Section 5. This distinction between target distribution and update also 
motivates the pseudo-marginal terminology adopted here. 

The dual interpretation of GIMH as an approximation of a MH with tar- 
get density ir(9) or an MH with target distribution density tt n (6,Z) there- 
fore opens the possibility for the design of algorithms that inherit the poten- 
tial efficiency of P while still being able to produce samples from tt(9), and 
not an approximation. However one can reiterate the questions asked earlier 
about the convergence properties of pMCWM and their relation to the ideal 
transition probability P. 

Before giving some answers to these questions, we first show how the 
approach can be easily generalized in order to allow for more sophisticated 
transitions, leading to potential "local adaptation" schemes for example and 
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also suggest new applications, such as model selection and applications to 
reversible jump MCMC algorithms [3]. Sections 3-5 are dedicated to the 
theoretical properties of generalizations of GIMH. Our main results are: 
Theorem 1, where we show that if the marginal chain is irreducible and ape- 
riodic then generalizations of GIMH also converge; Theorem 6 shows that 
under very mild and intuitive conditions, mainly (A2) and (A3), general- 
izations of GIMH have finite horizon convergence properties very similar to 
those of the marginal algorithm, provided that N is large enough; in Theo- 
rem 8, under more stringent assumption, we investigate the geometric and 
uniform convergence of generalizations of GIMH. Section 6 is dedicated to 
some theoretical properties of generalizations of MCWM which turn out to 
be much simpler to establish than for generalizations of GIMH. In particular 
we show in Theorem 9 that if the marginal algorithm is uniformly ergodic, 
then generalizations of GIMH can inherit this property with arbitrary pre- 
cision. We conclude with Section 7 where we show how the ideas developed 
in this paper can be used in order to design efficient reversible jump MCMC 
algorithms to perform model selection using very simple mechanisms. 

2. Set up and notation. Hereafter we will need the following notation. 
For some integer N > 1 let Z := (z(l), z(2), . . . , z(N)) G Z N denote a generic 
vector of Z N with coordinates z(k), k = l,...,N. For any Z G Z N and k = 
1, . . . , N we will denote Z~ k := (z{l), z{k- 1), z(k+l), z(N)) G Z N ~ l 
with obvious conventions. For any Z G Z N and Z = (z(ki), . . . ,z(ki)) (for 
I G {1, ... ,N —1} and k\, . . . , fcj G {1, . . . , N}) a subvector of Z, we define Z \ 
{Z} : = (z(l), . . . , z{h - l),z{h + 1), . . . , z{k 2 - l),z{k 2 + 1), . . .) and Z l : = 
(z(l), . . . , z(l)) G Z l , with the notational convention Z° = 0. 

2.1. The pseudo-marginal. Let (Q,B(&)) and (Z,B(Z)) be two measur- 
able spaces. Let n(d9,dz) be a probability distribution on the space (0 x 
Z,£>(0)x£>(Z)), let 7r(d0) be its marginal distribution and let us denote 
for any 6 G 0, ivg(dz) the associated conditional (on 9) distribution. Let 
{Q$ {dZ),9 G and N G N} be a family of probability distributions, the 
"proposals," defined on (Z N ,B{Z N )), {(wf , w£ , . . . G [0, 1]*, JV6N: 
J2% =1 w% = 1} be a family of weights and let {Zg,k = 1, . . . ,N and N G N} 
be a family of subvectors of arbitrary sizes of vectors of the type Z~ k as 
defined above. Before defining the pseudo-marginal and its associated joint 
model, we require the following assumption. Denoting for any A G B(Z), 
Q$ (zk G A\Z^) the conditional distribution or z/. given Z^ , 

(Al) We assume that for our choice of {Q$}, {w^} and {Zj?}, for all 
N > 1, any 9 G and k = 1, . . . ,N, ir e (-) < Q?(-\Zj? ). 
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(Al) allows one to define for any N £ N the following linear combination of 
Radon-Nikodym derivatives for (9, Z) 6 x Z N (the importance weights) 

(2 - 1) 7 (9) - t ?r fc o„ N (<i 3 wi^)' 

the dependence on Z being implicit. Whenever J (9) > 0, we define X N (9) := 
| log J N (9)\ and by convention we let X N (9) := +oo when ^ N (9) = 0. In turn 
we can define the following probability distribution on (0 x Z, £>(0) x£?(Z)), 

(2.2) n N (d9,dZ) :=7r{d9)Q^{dZ) 1 N (9), 

which, as we shall see, is a generalization of the underlying target distribution 
identified in equation (1.2). Hereafter for any 9 £ G we will denote 

(2.3) n£(dZ):=Q»(dZ) 7 N (9), 

the conditional probability distribution of Z given 9 in equation (2.2) and 
we introduce for any Z £ Z, 

Tf^dfl) :=ir(d9)>y N (6) 

the "pseudo-marginal." Note that whenever ir(d9) has a density vr(^) and 
provided that Z\9 ~ Qq , then the associated probability density tt n (9) := 
ir(9)~f N (9) is an unbiased importance sampling estimator of vr(0) a funda- 
mental property which ensures that ir(d9) is the marginal of TT N (d9,dZ). 
It can indeed be easily checked that for any 9 £ 0, (7 7V (^)) = 1- Note 
that in practice the variability of 7r(9)'y N (9) [and hence (dZ)] for any 
9 E is expected to have an important influence on the performance of the 
algorithm an illustration is given in Theorem 8. We will frequently use the 
following identities: 

(2.4) K N (d9,dZ) = n N (d9)Q$ {dZ) =ir(d9)n£{dZ). 

We conclude this section with various examples of choices of {w^} and 
{Z^} introduced in the general framework presented earlier. 

Example 1 (Classical importance sampling). The case where = 1/N 
and (dZ) is factorizable and exchangeable, that is, 

N 

Q$(dZ) = UQg(dz(i)), 

i=l 

leads to Beaumont's GIMH algorithm. 
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Example 2 (Sequential sampling). A choice of great practical interest, 
as illustrated later on in Section 7, consists of the case where Zf = Z 1 ' 1 , 
which allows for sequential sampling of {z(i)}, hence offering the possibility 
to adapt the sampling strategy in light of already sampled z(i)'s. This se- 
quential framework encompasses the case where {z(i)} is a realization from 
a Gibbs sampler with target distribution irg(dz) for some 9 € 0. In such 
situations nondecreasing sequences {w^?} might be preferable in order to 
discount the "burn-in" period. 

Example 3 (Gibbs sampler type). In some situations we might have 
good reasons to believe that the analytically intractable marginal distribu- 
tions of Q$ (dZ) for 9 G are good approximations of irg (dz) . In this case 
one can suggest the application of the algorithm with Zf = Z~ % , which can 
be interpreted as a random scan Gibbs sampler to sample from (dZ), 
and hence its marginals. 

2.2. Pseudo-marginal based algorithms. We now introduce a formal de- 
scription of the transition probabilities of the marginal algorithm and the 
two variants of the pseudo-marginal approach, which can be seen as gener- 
alization of MCWM and GIMH. The transition probability of the marginal 
algorithm, a standard MH algorithm, targets ir(d9) and uses Q(9,di)) as 
proposal distribution is defined for any 9,$ € as 



1- / 0L{6,ti)Q(6,d&) 
Je 



(2.5) P(9, d$) := a(9, <&) Q(9, d&) + 5 e {d&) 
where a(9, €) := 1 A r(9, $) with (a) 

on a symmetric set R C x (see [11], Proposition 1 and Theorem 2), 
(b) r(9,{>) := 0, ir(d9)Q(9, (M)-ahnost everywhere on the complement R c of 
R and (c) r(#,$) := 1 on measurable subsets of R c of Tr(d9)Q(9,d'&)-zero 
probability. 

The transition probability of the generalization of MCWM is not a stan- 
dard MH algorithm. It consists of proposing i? ~ Q(0, •), Z ~ Qff and 3 ~ 
Q$ , compute an acceptance probability a N (9,d) defined below, and accept 
or reject the proposal according to a N {9,'d). More formally the transition 
probability is defined for any 9, t? £ as 

P™ isy (0, d-d) := Q% ® Q# (a N (9, $))Q(9, d&) 

(2 - 7) r t 

+ 5 e {d$) 1- / Q$ ®Q^{a N {9^))Q{9,dd) 
Je 
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where <8> indicates the product for measures and a N (0,"0) := 1 A f N {9,d), 
with (a) 

(2Z\ , 

(2 - 8) r V' 6) -*N(dB)Q(0,d*y 

on an appropriate set i?c(9x Z N ) 2 , (b) r N (e,tf) := 0, n N (d9)Q(9, M)- 
almost everywhere on the complement of R and (c) 1 otherwise. Note that 
r N (9,$) can be computed even in situations where the normalizing constant 
of ir N (dO,dZ) is unknown but that, on the other hand, the normalizing 
constant of (dZ) might be required. 

The transition probability of the GIMH variant of the pseudo-marginal 
approach is of the MH type and is defined on the extended space G x Z N . 
It targets ir N (d9,dZ) and uses the proposal distribution Q(9,d'd)Q^ (e£3) 

P^ xact (0,Z ;f M,d3) 

(2.9) =a N {e,V)Q(0,d$)Q%{d3) 

l- / a N (e,$)Q(e,d$)Q${d3)}, 

JBxZ N 

with a N (9,'d) as above equation (2.8). This expression for the acceptance 
probability of the exact pseudo-marginal algorithm relies on an identity, 
which we will frequently use later on, between the marginal acceptance ratio 
(2.6) and its exact pseudo-marginal counterpart, 

(2 10) r N (9 tfl - ^WWQW'MKtfm _ t"(J) « 

(2.10) (ft,*) .- ^ {M ,dZ)Q{0,d<>)Q${d3) ~ 7"(*) n ' J 

for any (0,Z,0,3) G R ■= {(O,Z,0,3) : (9,$) G R, Z G Z e , 3 eZ#} with 
Zq := {Z G Z : 7^(0) > 0}. For any N G N and (0, Z) G 6 x Z N we will 
denote a(0, Z) [resp. p(0, Z)] 

(2.11) a(0,Z):=l-p(0,Z):= [ a N (9, tf)Q(0, d#)Q$(<Q), 

J0xZ N 

the probability of leaving (resp. staying in) state (9,Z). Note that we do 
not here make the dependence of this quantity on N explicit for notational 
simplicity. Similarly we will denote a(0) [resp. p{9)} the probability of leaving 
(resp. staying in) state 9 for transition P. 

In the next two sections we study the Markov chain {0j,Zj} started 
at (0o, Zq) G x Z N and with transition probability P^ tact (which will 
be denoted P/v for simplicity, when no ambiguity is possible) as given in 
equation (2.9) that is a MH with target distribution jr N (dO,dZ) and pro- 
posal distribution Q(9,d-d)Q^ (<i3) and acceptance ratio given by equation 
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(2.10). In order to analyze the performance of the Markov chain gener- 
ated by -FV we will embed the exact marginal Markov chain with transition 
P as in equation (2.5) defined on (B N , ,8(0 N )) into a Markov chain de- 
fined on ((9 x Z N ) N ,(B(0)xB(Z)) N ) as follows. We_define a Markov chain 
which is generated by a MH transition probability Pn, with invariant dis- 
tribution Tr N (d9,dZ) and proposal distribution Q{9, d^rc^ (d^>) [instead of 
Q{6 ',dd)Qft (d3) for Pn], leading to the transition probability, 

P N {9, Z; dtf, d3) := a(0, $)Q{6, d&)*$ '(d3) 

(2.12) 



1- / a{6,$)Q{6,d-d) 



(-> 



with a(9,"&) := 1 Ar(#,$), where r(0,'&) is as in equation (2.6). Our analysis 
will rely upon a comparison of Pn and Pn, which live on a common space. 

Finally, we will hereafter use the following standard notation for prob- 
abilities and Markov chain transition probabilities. For a space (E,£) we 
define for f:E — > R: for /x a measure on (E,£) p(f) := f E f(x)fx(dx); ||/x|| := 
l su P|/l<i IM/)h for any A£ £ fi(A) = fi(I{x G A}), where I{- G A} = 
denotes the indicator function of set ^4; for a transition probability II : E x 
£ - [0, 1] n/(x) := f E U(x, dy)f{y) and IP/(x) := U(x, IP" 1 /) for i > 1 with 
n°/(x) :=/(a:). 

3. A simple convergence result for exact algorithms. The theory of 
-^-irreducible Markov chains has proved to be a very powerful tool in or- 
der to analyze classical MCMC algorithms, and in particular the MH algo- 
rithm. More precisely, since MCMC deal with the situation where ttP = tt, 
then if in addition P is ^-irreducible and aperiodic, it can be shown that 
\\P k (6o, •) — 7r(-)|| — ► as k — > oo 7r-a.s. [4]. This motivates the following the- 
orem. 

Theorem 1. Assume (Al) and that P defines a if) -irreducible and ape- 
riodic Markov chain such that ttP = tt. Then for any JVeN such that for 
any (9,Z) G x Z N , p(6,Z) > [with p{6,Z) as in equation (2.11)], P N is 
also if) -irreducible and aperiodic, and hence n N -a.s. [in {0 O , Z ) G G x Z n ], 

lim \\P n (9o,Zo;-)-tt n (-)\\=0. 

k— >+oo 

Proof. We here drop N for simplicity. First notice that by construction 
if P is ^-irreducible and aperiodic, then so is P [defined in equation (2.12)] 
and consequently P defines an ergodic Markov chain with invariant distri- 
bution tt . We will show that under the assumptions the accessible sets of P 
are included in those of P, which will allow us to conclude. More precisely 
we show by induction that for any k G N, (S,Z)e6x Z N and A x B G 
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B(@) x B(Z N ) such that P k (9, Z;AxB)>0, then P k (9, Z;AxB)>0. For 
any 9 £ 9 recall that := {Z : j N (9) > 0} and for notational simplicity we 
will use the convention lAj N ( , 9)/'-f N (9) = 1 whenever "y N (9) = below. First 
notice that from (2.10), for any (9, Z) £ Q x Z N and Ax B £ B{Q) x B(Z N ), 

P(9, Z; A x B) > Q$ (l A ^]M^O ^Bn Z^j a(9, #)Q(0, d#) 
+ 1{(9,Z) eA x B}p(9,Z) 

(3.1) 

+ I{(9,Z)eAxB}p(9,Z). 

Consequently, since for any 9 £ G and B £ B(Z N ) we have tt^ (5) = tt^ (B D 
Z$), we deduce that the implication is true for k = 1. Assume the induction 
assumption true up to some k = n > 1. Now for some (9,Z) £0x Z w let 
Axfi£ 5(9) x iB(Z iV ) be such that P™ +1 (6», Z;Ax5)>0 and assume that 

/ P n (9,Z;d$,d3)P(#,3;AxB) = , 
Jexz N 

which implies that P(i9, 3;AxB) = 0, P n ((9, Z; -)-a.s. and hence that P(i9, 3; 
AxB)=0, P n (9,Z; -)-a.s. from the induction assumption for fc = 1. From 
this and the induction assumption for k = n, we deduce that P(t?,3;^4 x 
B) = 0, P n (9,Z;-)-&.s. (by contradiction), which contradicts the fact that 
P n+1 (9,Z;Ax B)>0. □ 



4. Performance of the pseudo-marginal approach. For the purpose of 
our analysis we introduce the following subsets of 0. For any e > 0, N £ N 
and denoting for any random variable X with probability distribution /i, 
fi(X>e):=fi({X:X>e}), 

(4.1) T(e, N) := {9 £ 9 : Q$(\ N (0) > e) < e}, 

(4.2) 5(6, N) := {9 £ 9 : Q(0, T(e, iV)) > 1 - e}, 

(4.3) TZ(e,N):=S(e,N)DT(e,N), 

with X N (9) as defined below equation (2.1). For a set ic8we will denote 
.A its complement in 9. 

The main result of this section is Theorem 6 and relies, in addition to 
(Al), on the following mild assumptions. 

(A2) For any 9 £ 9, lim^ \\P k (9 , •) - vr(-) II = 0. 
(A3) For any 9 £ 9 and any e > 0, 
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lim Q$(\ N (e)>e)=0. 

N— >oo 

Assumption (A3) is fundamental to our analysis, and implies the following 
two lemmata. First, it is a sufficient condition to control the total variation 
distance between tt^ and Qq. 

Lemma 2. Assume (Al) and (A3). Then for any 9 G Q, N G N and 
6 6(0,1], 

||^(-) - Q?(.)|| < (3 + e)e + 2Q^(A» > e)I{9 G ^(e,iV)}. 
Proof. For any 9 G 0, from equation (2.3) 

11^(0 - <#(0II = Qe(h N (0) - i\i{\ N (e) > e » 

(4.4) 

+ Q^(| 7 >)-l|I{A A >)< e }). 

On the one hand 

(4.5) Q^ClW) " im N (0) < e}) < exp(6) - 1 < ee, 

and on the other hand 

Q^(| 7 » - 1|I{A» > e}) < Q» ( 7 N (9)I{\ N (9) > e}) 

+ Q$(\ N (0)>e). 

Notice that 

(1 - e)Q$(\ N (9) < e) < exp(-e)Q$ (\ N (9) < e) < Q$ ( 7 N (9)I{\ N (9) < e}). 

This, together with the fact that for any 9 G G, Qq{i N {9)) = 7^(1) = 1, 
leads to 

Q^( 1 N {9)l{\ N (9) > e}) = 1 - Q, Ar (7 7V (^)I{A iV (0) < e}) 
(4.7) <l-(l-e)(l-Q^(A»>e)) 

<e + Q$(\ N (9)>e). 

Now combining equations (4.4)-(4.7), we have for any 9 G G, 

(•) - (Oil < (1 + e)e + 2Q$(\ N (6) > e) 

(4.8) 

< (3 + e)e + 2Q$(\ N (6) > e)I{0 G T(e, N)}. □ 

Assumption (A3) also implies the following important intermediate re- 
sults. 



lim n(T(e 

N—>oc 


N)) 


= 1 


lim /x(5(e 

TV— ><x> 


AO) 


= 1 


lim n(lZ(e 

N^oo 


AO) 


= 1 
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Lemma 3. Lete>0 and S(e,N),T(e,N) andK(e,N) be as in (4-1)- 
(4-3). Assume (Al) and (A3). Then for any probability measure fx on (Q,B(@)), 

(4.9) 
(4.10) 
(4.11) 

Proof. For any e > and 9 G 6, limjv^oo I{6 &T(e,N)} = 1 from (A3). 
Equation (4.9) follows from the dominated convergence theorem. To prove 
equation (4.10), note that equation (4.9) implies that for any 9 G O, 

lim Q(9,I{tfeT(e,N)}) = 1. 

N^oo 

Consequently for any 9 G 0, lim.N^ 00 l{9 G 5(e, AO} = 1 and for any prob- 
ability measure fx, we have lhmv^oo fJ,(I{9 G S(e,N)}) = = 1. Equation 
(4.11) is immediate. □ 

As we shall see, our results heavily rely on an estimate of the distance 
between pv and P/v under (A3). 

Lemma 4. Assume (Al) and (A3). Let e G (0, 1] and (9, Z) G S(e,N) x 
Z such that X N (9) < e [S(e,N) being defined in (4-2)]. Then for 
PjV as defined in equation (2.12) and for any & x — > [—1,1], 

|P Jv ^,Z)-P iv ^,Z)|<24c. 

Proof. For simplicity we here drop A" in the notation for the transition 
probabilities. Let (9,Z) G 9 x Z N and ip : 6 x Z N [-1,1]. We have, by 
definition of P, 

Pi>{9,Z)-Pij}{9,Z) = S l + S2, 

with 

s x :=?i>(p,z)-Pi>{e,z\ 

S 2 :=P^(9,Z)-P4,(9,Z), 

where P is the MH transition probability with invariant distribution 7r(d9)Q^ (dZ) 
and proposal distribution Q(9,d-d)Q^ (d3), that is 

P(9, Z; d#, ft) := a(9, d)Q{9, d#)Q$ (<Q) 

(4.12) 



;(<&M3) 



1- / a(M)Q(M#) 
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From this and the definition of P in equation (2.12), the first term writes 
\Si\<\Q{6,{t$ -Q$)a{6J)il>)\ 

<2Q(0,\\n$(-)-Q$(-)\\). 
From this, Lemma 2 and since 9 G <S(e, N) 

\Si\< 2[(3 + e)e + 2Q(0, (A*(0) > e)I{0 G f (e, TV)})] 

<2(5 + e)e. 
The second term writes 



(4.13) 



(4.14) 



So 



exz N 
1/1(9, Z) 



1 A r(9, ■!?) — 1 A 



exz' 



7 "(0) 

„N 



r{9,$) 



1 Ar(9,$) — 1 A 



yv(£) 



<m) 



Q(9,d0)Q$(d3) 



and we therefore focus on the quantity 



on :- 



0xZ' 



exz^ 



1 Ar(9,$) — 1 A 



lAr(M)-lA 



<m) 



+ 



exz^ 



lAr(0,0)-lA 



7 "(0) 
7^(0) 



r(0,tf) 



Q(9,dti)Q${d3) 
I{\ N {$)>e}Q{9,d$)Q$(d3) 



r{9,$) 



I{\ N {ti)<e}Q(9,d#)Q$(d3). 



Noting that for any (x,y) G M 2 , 

|1 A exp(x) - 1 A exp(y)| = 1 A | exp(0 Ax) - exp(0 A y)\ < 1 A |x - y\, 
we deduce that for 9 G «S(e, iV), 

S <Q(9,Q$(I{\ N (O)>e})) 

+ Q(0, Q^(l A | log( 7 ») - log(7 JV W)|I{A JV (7?) < 6})), 
and consequently since J N (9) depends on 9 and Z only, 

|5 3 | < 2(1 A X N (9)) + 2Q(9, Q$ (I{\ N (tf) > e})) 

+ 2Q(0, Q#(l A \ N (0)I{\ N (0) < e})). 
Consequently since G S(e,N), 

\S 2 \ < 2(1 A X N (9)) + 2Q(0, (I{A JV (^) > e}I{tf G T(e, jV)})) 
+ 2Q(0, (1 A A*(0)I{A*(0) < e})I{tf G T(e, 2V)}) 
(4.16) + 2Q(9, Q$ (I{X N m > e}I{tf G T(e, iV)})) 

+ 2Q(9, Q$ (1 A A* (0)1^(0) < e})I{tf G T(e, AT)}) 
< 2(e + e + e + e) =8e. 



(4.15) 
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One concludes by combining (4.14) and (4.16). □ 

We now combine Lemmata 3 and 4 to prove the following proposition. 

Proposition 5. For any eG (0,1] and any probability measure fj> on 
(0, B(@)), there exists N(e, /j.) such that for any N > iV(e, /i) and any ip : x 
Z"->[-l,l], 

\^{{P N -P N )^{9,Z)))\<e. 

Proof. Let £G (0,1], /i be a probability distribution on (0,5(0)) and 
y?:0— > [—1,1]. We have, with 7Z(e,N) defined in equation (4.3), 

\ti(<p)\<\fi(<pI{0eK(e,N)})\ + (,(K(e,N)). 

By Lemma 3 there exists No(e,fi) G N (independent of (p) such that for 
N>N (e,»), 

(4.17) \»(it(e,N))\<e. 
For any ip:Q x Z N — > [—1,1], we have 

(4.18) fi(I(6 G TZ(e, N))^((P - P») = T l + T 2 , 
with 

Tx = G 1l(e, N))^(I(X N (9) <e)(P- P»), 
T 2 = /x(I(0 G ft(e, N))^(l(X N (9) > e)(P - P»). 
We apply Lemma 4 to Ti and conclude that 

(4.19) |Ti|<24e. 

We now turn to T 2 . First recall from equation (2.3) that for any 9 G 0, 

v^(A» < e) = Q$ (7* (9)I{X N (6) < e}), 

and, since Q$ (X N (9) > e) < e for 9 G T(e,N), we conclude that for any 
9eT(e,N) 

(1- £ ) 2 <(1-^(A»< £ ) 

< (^(exp(- e )]I{A» < s}) < 7T^(X N (9) < e). 

Hence, from the definition of 1Z(e,N), 

\T 2 \ = \f,(I{6 G TZ(e, N)}^(1{X N (9) > e}(P - Pty))| 

(4.20) < 2\fx(I{9 G K(e,N)})\(l - (1 - e) 2 ) 
<4e. 
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Now we choose e = e/30 and conclude with N(e,p) = Nq(e,p) and by com- 
bining (4.17) (for 2(p), (4.19) and (4.20). □ 

Our main result is as follows and provides us with a bound £ on the loss 
of efficiency of the approximating chain compared to the ideal chain, which 
can be made arbitrarily small for large iV's. Define for any 9 G O and any 
eG (0,1], k(e,9) :=mf{k:\\P k (9,-)-ir(-)\\ <e} and recall that for any 9 G 9, 
p{9) := 1 — J e o;(0 ,$)Q(9 ,cM) is the probability of not leaving 9 for P. 

Theorem 6. Assume (Al), (A2) and (A3). Let e,£ > and 9 G G. 
Then there exists N(e,£,9o) G N such that for any N > N(e,£,9o) and Zq G 
x Z N such that X N (9 ) < £e/(24k(e, 9 )) we have for any k > k(e, 9 ), 

\\P*(9 Q ,Z ;-)-7r N (-)\\<(l+£)e + p k (9 ). 

Corollary 7. Under the assumptions of Theorem 6, for any e,£ > 
and 9q G 0, there exists N(e,£,9o) G N such that for any N > N(e,£,8o) and 
Z £QxZ N such that X N (9 ) < £e/(24k(e,9 )) we have for any k > k(e,9 ) 
and any ip : O — ► [—1,1], 

±|P£(0o, Z ; ip) - 7r(<p)\ < (1 + £)e + p k (0 ). 

Proof. Dropping N for notational simplicity, we have that for any k > 
1, (9 ,Z ) G x Z N and any V:© x Z^^ [-1,1], 

(4.21) P fc y>(0 o , Zq) - tt(^) = S o (*0 + S^k) + 5 2 (fc), 
with [7fg (tp) := ng (ip(9, •)) hereafter for notational simplicity] 

S (k) = P k rP(9 , Z ) - P k (*e(mOo), 

s l {k)=p\^m%)-^m, 

S 2 (k) = P k ifj(9 ,Z )-P k ^8 ,Z ), 

where the magnitude of Si(k) can be controlled thanks to the properties 
of the transition probability P, So (A;) and S 2 (k) correspond to the bias 
introduced by the approximation to the "ideal" chain. As we shall see, for a 
fixed k this bias can be made arbitrarily small for N sufficiently large. Let 
e > and (9q,Zq) G x Z n such that \ N (9o) < e. By a coupling argument 
or induction |5b(fc)| < 2 p(9 ) k . Since vrf ty) :0 -»■ [-1, 1], by (A2) we have 

(4.22) k(e, 9q) < +oo and \Si(k(e, 8 ))\ < 2e. 

From now on we set k$ := k(e,8o) and use the following telescoping sum 
decomposition: 

k -i _ 

S 2 := S 2 (k ) = P l P ko ~ l H0o, Z ) - P l+1 P k °-« + ViP(9 , Z ) 

1=0 
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fco-1 _ 

= ]T P\P-P)P k ^ l+l H{6 ,Z ). 
1=0 

Let e G (0, 1]. Noticing for any I > 1 we have for any tp:Q x Z N — > [—1, 1] 
P^(0 o ,^o) = p(0o)^o,^o) 

+ E Pt-^QVj-i, •)))}(*(>, Z ) 

we apply Lemma 4 fco times and Proposition 5 (ko — 1) times (the result 
trivially applies to any finite measure) to show that there exists N(e,9o) 
such that for any N > N(e, 6> ) and some C < £/(24A; ) 

(4.23) | Si | < 2ACk e + (ko - l)e. 

We conclude by taking e = 2e(£ — 24koC)/(ko — 1) in equation (4.23) and 
combining with equation (4.22) in equation (4.21). □ 

5. Uniform and geometric ergodicity of exact algorithms. In this section 
we illustrate the critical importance of the choice of a good importance 
sampling distribution Q$ to ensure that P/v is uniformly ergodic. More 
precisely we show that, for a given N G N, if the importance weights J N (9) 
involved in the definition of the pseudo-marginal tt n (dQ) are unbounded for 
"too many" 0's, then Pjy cannot be geometrically ergodic. As we shall see 
"too many" will be quantified in terms of the measure of the set U N := 
{6 : VAf > 0, Qq(i n {6) > M) > 0} under vr. In addition, for a fixed N G N, 
using the fact that it is most often possible to prove the uniform ergodicity of 
the MH update P defined in (2.5) by establishing a minorization condition 
for the sub-stochastic kernel K(9, diD) := a(6, $)Q(6, d'ff) [that is there exists 
n o > 1) a constant e > and a probability measure v on (Q,B(Q)) such that 
for any (0,A) G x B(&), K n °(8,A) > eu(A)], we show that this property is 
systematically inherited by Pn whenever 7^ := supg ze0xZ N 7 W < +00. 

Theorem 8. Assume (Al) and let N G N. Then: 

1. if tt(U n ) > 0, then Pjv cannot be geometrically ergodic; 

2. if we assume that there exist no > 1, a constant e > anci a measure v on 
(9,5(6)) sucfc too* /or any 0, A G 9 x 5(9), K no (9,A) > eu(A) (which 
implies that P is uniformly ergodic) then if in addition 7^ < +00 then 
Pjv is uniformly ergodic. 

Remark 1. Concerning the second point of the theorem, it should be 
noted that it is not possible in general to achieve the rate of convergence 
of the marginal chain P, even when {7^} is bounded. Indeed, consider the 
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independent MH algorithm, in the discrete case for simplicity and densities 
with respect to the counting measure. It is possible to characterize exactly 
the second-largest eigenvalue of the transition probability. For P it takes the 
form 1 — (supg g Q "^[fy)" 1 ' while for Beaumont's form of the pseudo-marginal 

algorithm it will take the form 1 — (sup^^gQxz q(e'z) ^ n ^ ne particular 
case where ir(0,z) = tt(9)tt(z) and q(9,z) = q(8)q(z) this latter expression 
becomes 1 — (sup 0g @ ^fy) _1 ( su P2eZ ^jfy) -1 which in general will be larger 
than 1 — (supg g Q 1 - As we shall see this is not the case for the MCWM 

algorithm under appropriate assumptions (see Theorem 9). 

Proof of Theorem 8. We drop N in P/v and prove the first statement. 
We want to show that under the stated assumptions, for any e > 

Z N (I{a(e, Z) < e}) = Tr{Q$( 7 N (9)I{a(9, Z) < e})} > 0, 

where a(6,Z) is defined in equation (2.11). From [10], Theorem 5.1, this 
indeed implies that P cannot be geometric. For any (9, Z) £ x Z N with 
7 (0) > 0, by Fubini's theorem, Jensen's inequality and since for any i?g6, 
(7^(0)) = 1, we have 

a N (9,Z)= I (i A ?^r(6,&))Q(e,d#)Q$(d3) 
Jexz N \ T {?) J 

For any (6*,$) £U N x 0, since r(0,i?) < +oo [see equation (2.6)] and 

{Z: 1 N {9)>M}^0 

for any M > 0, we have 

r(0,0) 

hm sup 1 A N = 0. 

M^oo {z . lN(e)>M} 7 J V((9) 

Consequently, by the dominated convergence theorem, 

lim sup a(9, Z) = 0, 



M^oo 



{Z:j N (8)>M} 



hence for any e > there exists M < +oo such that 

Q$({Z: 7 N (9)>M,a(9,Z)<e})>0 

and hence 

Q^( 1 N (8)I{a(8, Z) < e}) > MQ^(I{ 1 N (9) > M, a(9, Z) < e}) > 0. 
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We deduce that 

n{Q^( 7 N (9)I{a(9,Z)<e})}>0. 

We now turn to the proof of the second claim. We first show by in- 
duction that for any k > 1 and A x B £ B(Q) x B(Z N ), P k (9,Z;A x B) > 
7 " fc f A K k (6,d$)n$(B). For k = l, 

P(9,Z;AxB)> [ a N {6,-&)Q{0,d-&)Qt{<Q) 

J AxB 

JaxbK t (?) J 

(5.1) 

>7* _1 I Q$(j N ma(9,V)Q(9,d&) 

J AxB 

= 7"* / K(9;d^(B). 

J A 

Assume the inequality is true for some k > 1. Then from the induction 
assumption and equation (5.1), 

P k+1 (9,Z;AxB) = I P k (9,Z;d$,d3)P($,3;Ax B) 
Jexz N 

> [ 7 : k K k (9,d$)n$(Z N h: 1 [ K(9;d$)n$(B) 
J0 J A 

> 7 ~ (fc+1) / K k+1 (9,d#)Tr$(B). 

J A 

Hence the result. From this result for k = n$ and the minorization assump- 
tion on K we deduce that for any (9, Z, A x B) £ G x Z N x (8(9) x ^(Z^)) 

P no (9,Z;A xB)> 7 - no / K n °(9,d$)^(B) 

J A 

>e. 1 ; n °v$AW$<,B)), 

and hence the second claim. □ 

6. Epilogue: MCWM. As pointed out earlier, the analysis of generaliza- 
tions of MCWM, defined in equation (2.7), is simpler than that of GIMH 
generalizations and relies on more classical arguments. This is due mainly 
to the fact that in this case Z ~ Qg and 3 ~ Q$ • However the existence of 
an invariant distribution for Prf 1 is not obvious in general (it is not a MH 
update). This is in contrast with i-'^ cact , for which the invariant distribution 
as well as its marginal distribution are known to be TT N (d9,dZ) and ir(dO), 
respectively. We give here a result which characterizes the invariant fc N (d9) 
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when it exists, and the rate of convergence of P^° lsy (denoted Pjy i n this 
section for simplicity), when P is uniformly ergodic, and a simple uniform 
weak law of large numbers holds for X N (9). 

(A4) There exist C G (0, +oo) and p G (0, 1) such that for any 9q G O and k G 

\\P k (Oo,-)-*(-)\\<C P k . 
(A5) We assume that for any e > 0, 

lim supQ${\ N (6)>e) = 0. 

N^oo ee0 

Theorem 9. Assume (Al), (A4), (A5) and that for any N > 1 there 
exists a probability distribution tt n on (0, £>(©)) such that tt n Pn = ft N , with 
PjV = i/y° lsy defined in equation (2.7). Then for any e G (0, p _1 — 1) there 
exists N(e, p) G N, p G (p, p(l + e)] C (/?, 1) and C G (0, +oo) such that for all 
N>N{e,p), 6> G9 andk>\, 

(6-1) \\P k (9 ,-)-n N (.)\\<Cp k , 
(6.2) |K-^||<C 



1-p 

Proof. In some instances we here drop N for notational simplicity. 
First notice that for any 9 G and n G N, 

n-l 

\\P n (e, •) - P n (0, Oil < E ll^^, (p - p)(p 1 - t)(0)II 

n-l 

(6.3) <Csup||P(e,0-mOllE^ 

eee i=0 

<-^— sup||P(0,O-mOII- 
1 - P eee 

We bound the last term on the right-hand side. We have 

P(6, d<&) - P(6, dd) = Q$ (a(9, 0) - 5(0, Z; 0, 3))Q(9, d0) 

+ 6 e (d$) [ Qj?®Q$(a(e,Z;0,3)-a(e,0)) 
Je 

x Q(0,d<&). 

Let e G (0, 1] and notice that since 

{Z, 3 G Z^} C {Z, 3 : (6) > e} U {Z, 3 : \ N (#) > e} 
U{Z,3:\ N (e)<e,\ N (#)<e}, 
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we have 

\Q? ® Q$(a(6, 0) - a(6, Z; 3))| < Q${\ N (0) > e) + Q^{\ N {$) > e) 

+ Q$(lA\ N (6)I(\ N (9)<e)) 
+ Q^(lAA 7V (i9)I(A JV (i9)<e)). 

Following the proof of Lemma 4, and from (A5), we conclude that there 
exists N(e) such that for N > N(e) and any 9,i>S0, 

\Qe He, 0) - 5(0, z- 0, 3))| < ie. 

Consequently for any e G (0, 1] there exists iV(e) G N such that for any > 
N(e) and 

||P(0,-)-m-)ll<4e. 
As a result and from (A4) for any (6, i?) G O and n G N, 

l|P»(0, .) _ p»(^ s .)|| < \\pn ie , •) - p n (e, on + ||P n (tf, - ^ n (tf, Oil 

+ 11^,0-^(61,011 
<c> n + e 



Define 



p:=pr/C(l + 4^)<pv^fl+ 1 ^ 



1 — p J V n 1 — p 

Choose e G (O,/}" 1 — 1) and let n G N be such that \/~C < y/1 + e and e 
(depending on n and p) be such that 

1 8ep~ n . 

n(l- p) 

This implies that p < p(l + e) < 1 for N > N(e), and hence that equation 
(6.1) follows. To prove equation (6.2) we notice that from equation (6.3), for 
any n > 1 and N > 2V(e/4) 

n— 1 n—1 ^-r 



||^pn_^pn|| < ^|| 7r (pn-i-l(p_p)(pi_ 7r )(.))|| <Ce£y <- 
i=0 i=0 

and since 1 1 vr — tt \\ = bim^oo ||-7rP n — tv P n ||. We conclude the proof by 
taking N(e,p) =N(e) VJV(e/4). □ 
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7. Examples: Reversible jumps. In this section we illustrate the poten- 
tial of the pseudo-likelihood approach to Monte Carlo computations devel- 
oped in this paper in the context of reversible jump MCMC [3] algorithms 
(RJMCMC hereafter), which are well known for their difficult implementa- 
tion. We start with a toy example, for which the true marginals are known 
exactly, hence providing a simple ground truth, and illustrate the interest 
of the approach in a scenario, which in our opinion reflects the difficulties 
encountered in practice when implementing RJMCMC in more realistic and 
difficult scenarios. We then move on to a more substantial example related 
to variable selection for generalized linear models. We first show how an ap- 
parently reasonable RJMCMC applied to a seemingly simple nested models 
selection problem can easily fail to produce reliable results and demonstrate 
how our methodology can easily circumvent this problem and render the 
algorithm much more reliable. 



7.1. Toy example. We consider here a toy transdimensional target dis- 
tribution defined on{l}x]Ru{2}x]R 2 , 



+ 





X 




"0" 




1 -0.9" 




('- 


.y. 







,E = 


-0.9 1 


) 



In a Bayesian setup this would correspond to an inference problem for which 
two models M\ and M2 are considered to explain the data, the models being 
indexed by 0. Obviously here tt(9 = 1) = 1/4 and tt(9 = 2) = 3/4. However 
in order to illustrate our methodology we develop here a reversible jump 
algorithm [3] to sample from this distribution, and compare our results with 
the exact distribution. A simple marginal ideal chain can be defined through 
the transition 



(7.1) P(6,tf) = lA^-I{$^e)+I($ 

7T(0) 



1- 1 A 



n(9) 



■!{■&' + 



that is in other words when in model M.\ (resp. M.2) we propose a jump to 
model M2 (resp. .Mi) with probability 1. This chain is obviously uniformly 
ergodic, which is often the case for finite discrete chains in practice. Now 
assume that we are given the algebraic expression for ir(6,z) up to a factor 
of 1/4 and that, possibly with age, we fail to recognize 3 times a bivari- 
ate normal distribution for model M.2- For simplicity, we will assume that 
we successfully recognize a univariate normal distribution for model M\. A 
standard approach in such situations consists of resorting to a RJMCMC 
algorithm that uses the (available to a constant) density tt(0, z) for model 
M2 and requires one to propose a bidimensional vector z ~ Q(-) when at- 
tempting a move from model M\ to M.2- Naturally the effectiveness of the 
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algorithm will, as we shall see, highly depend on this proposal distribution 
whose choice might be far from obvious in more complex scenarios. In order 
to improve this basic RJMCMC algorithm, we investigate here a very sim- 
ple strategy which relies on the pseudo-likelihood framework described ear- 
lier more sophisticated and efficient approaches are possible and currently 
being explored in other work. For any rj > 0, let N v (z\ /i, E^) denote the 
truncated normal distribution such that M v {z; //,£d) cx M{z; fi, S^) when- 
ever {z — //) T S _1 (2; — n) <n 2 and Af v (z; S^) = otherwise. For 9 = 2 we 

define (Z) = Q(z(l)) Y\tM<^ + ^V, logvr e (z(i - l)),a 2 / 2 ) 

for some a 2 <l, that is we use a form of discretization of the Langevin diffu- 
sion with drift ^V 2 logTTg(z), and run a Markov chain with transition density 

U(z,$) =N v ($;z + ?>-V z logTTg(z), a 2 ) whose equilibrium distribution is an 
approximation of irg(z). The algorithm proceeds as the marginal algorithm 
described earlier in equation (7.1), except that for 6 = 2, ir(9) is replaced 
with the estimator 



n N (6) = - 
V ; N 



(7.2) 



TT(Ml)) 

Q(*(i)) 

,yV -*{6,z{i)) 



i=2 



M v (z(i); z(i - 1) + a 2 /2V z log7r 9 (z(i - 1)), a 2 



where z(l), z(2), . . . , z(N) are sampled according to above. Note that the 
case N = 1 corresponds to the "standard" RJMCMC algorithm described 
above. For the purpose of illustration we took Q(z) =M(z; [3,3] T ,/2)> which 
while being an obviously bad choice ensures irreducibility, and hence (in 
theory) convergence for the standard RJMCMC algorithm. We ran our al- 
gorithm for N = 1,5 and 10 for 450,000, 90,000 and 45,000 iterations, re- 
spectively, resulting in comparable computational efforts. The respective 
empirical expected acceptance probabilities were 0.0121, 0.5206 and 0.5056 
(note that the theoretical expected acceptance probability in the stationary 
regime for the marginal algorithm is 1/4 + 3/4 x 1/3 = 1/2). In Figure 1 we 
present the "instantaneous" model probability estimators ^/kJ2i=i^(@i = 1) 
and l/kY^i=i^-(Gi = 2) as a function of the iterations k. Note the deceptive 
behavior observed for JV = 1 which suggests that convergence has occurred. 



7.2. Application to variable selection in GLMs. In this section we present 
an application of the pseudo-marginal principle to model selection in gen- 
eralized linear models, and focus here more particularly on the logit link. 
More precisely we assume that we observe M > 1 realizations (yi,Xi) for 
i G {1, . . . , M} of a random variable pair (Y, X) taking values in {0, 1} x R fc 
for some k > 1 and that the dependence between Y and X is characterized 
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FlG. 1. Instantaneous estimation of the model probabilities as a function of the iterations 
for N — 1, 5, 10 from top to bottom. 



by the conditional distribution P(Y = 1\X, z 

F(Y = l\X,z) 



log 



l-F(Y = l\X,z) 



assumed to satisfy 
= Xz, 



for a column vector zGl*. Not all components X(l), I = 1, . . . , k, of X (the 
"covariates" ) might be relevant to sparsely explain the data and as a result 
we might seek to compare models for which some of the components of z are 
set to zero this is what we refer to as the model selection problem. In order to 
carry out inference in a Bayesian setup it is convenient to introduce indicator 
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variables 9(1) £ {0, 1} for I = 1, . . . , k such that covariate X(l) is excluded 
whenever 9(1) = 0. This allows us to index the 2 k — 1 models (we exclude 
the model with no dependence) with the vector 9 := (9(1), 9(2), . . . , 9(h)). 
Let C be the M x k matrix whose ith row is Xi, we then denote Cq the 
submatrix of C that contains the columns C for which 9(1) = 1 and likewise 
for zq the subvector of z. It will be convenient to denote 9 = Z^f=i^(0 the 
number of active covariates in model 9. We ascribe prior distributions to 
k and zq : Pr(9 = k) oc X k /k\ for some fixed A > and following [6] we 
set zq ~7V r (0, [CgCe/^M] -1 ) a priori. Denoting Pi(z$) :=F(y« = l\xi,zg) the 
joint posterior distribution is 

N 

tt(9,z ) oc Y[ Pi (z e ) e W(l -p i (z e )) 1 - m N(ze;0, [C^C e /4M)- l )X s /9l 

i=l 

Variable selection in a Bayesian context typically relies on the marginal 
posterior model probabilities tt(9), which are in the present situation in- 
tractable. One can for example resort to MCMC and this is the route fol- 
lowed up here. 

The basis of our algorithm is a reversible jump MCMC algorithm for the 
marginal model which consists of a birth/death update. We first describe 
a marginal algorithm with transition P(9, dd) which of course cannot be 
implemented. The pseudo-marginal algorithm will be a simple variation of 
this algorithm. Given a model 9, P(9, dd) can be described algorithmically 
as follows. With probability 1/2, either: 

• set 9+ = 9 and if 9 < k, 

1. choose uniformly among the k — 9 nonactive components' indexes, say 
j, and set 9 + (j) = 1, 

2. set $ = 9 + with probability 

(7.3) 1A .^)l/(»+D 



tt(9) i/(k-ey 

otherwise •& = 9, 

or 

• set 6~ = 9 and if 9 > 0, 

1. choose uniformly among the 9 active components' indexes, say j, and 
set 9~(j)=0, 

2. set -d = 9~ with probability 

(7.4) 1A .,<»-) l/(*-* + D 



?r(0) 1/9 
otherwise $ = 9. 
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This algorithm cannot be implemented, but it is nevertheless possible to 
implement a reversible jump algorithm on the joint distribution 7t(9,zq). A 
solution suggested in [2, 6], which we will refer to as the standard RJ algo- 
rithm here, can be understood as being a simple variation on the algorithm 
above, where in the birth move the additional sampling of a new coefficient 
from a distribution Q is required, resulting in the acceptance ratio 

7T(g+, Z +) 1 l/g+l) 

7T(e,z e ) Q(z+(j))i/(k-ey 

In [6] it is suggested to use as a proposal distribution for z + (j) the marginal 
of the normal distribution with mean the maximum likelihood estimator 
z mi °f the saturated model and covariance the corresponding Hessian. Our 
pseudo-likelihood algorithm is very similar to the algorithm developed for 
the toy example in the previous section, that is it relies on a discretized 
Langevin diffusion, and consists formally of simply replacing ir(0 + ) and 
7r(#~) in the pseudo-code above with an estimator of the form equation 
(7.2) for 6 € {0 + ,6~}. The following setup was considered. We generated 
artificial data from the logit model as follows. We chose M = 50, k = 4, a set 
of coefficients z* = [1 0.5 —2 0.01] T and generated covariates as follows: with 
Zi ~ M(0,I M ) we set d = Z t for i = 1,3,4 and C 2 = 0.9 x Z\ + 0.1 x Z 2 . 
This resulted into two correlated covariates, number 1 and 2. The maximum 
likelihood estimate for z* was found to be z^ x = [5.22445 - 3.71672 - 
2.4011 — 0.587472] 1 , suggesting (a) the presence of a main mode for the sat- 
urated model around this value, significantly different from the truth and 
(b) a mismatch between the modes and marginal modes of 7r(llll, z\\\\) 
with those of vr(1011,zion) (z^ = [1.73253 -2.30933 - 0. 648927] T ) and 
7r(0111,z iii) (zohi = [1-5968 -1.98855 -0.0922961] 1 ) which might result in 
poor mixing of the standard "birth-death" RJMCMC algorithm described 
above, a behavior likely to be reinforced here by the choice of the proposal 
distribution. This is confirmed by our simulation. In Figure 2 we present the 
estimated model probabilities (indexed by the decimal representation of 9) 
for N = 1, 5, 50, 100, 200. Note that for the case JV = 1 the birth/death move 
was complemented by 10 iterations of a within model one variable at a time 
random walk MH for each sweep. We observe the large discrepancy between 
the results for N =1,5 and the results for N = 50, 100, 200 the latter being 
in agreement. Note that this is despite the apparent convergence of estima- 
tors of the posterior inclusion probabilities ¥(6(j) = 1) for j = 1, . . . , k for 
the case N = 1 (Figure 3) and that the results obtained after 20,000, 10,000 
and 5,000 for N = 50,100 and 200 are much more reliable than for N = 1 
after 1,000,000 iterations. The respective expected acceptance probabilities 
are given in the table below. Note that using the estimated model probabili- 
ties obtained for N = 200 one finds an acceptance rate of 0.29592. Finally, in 
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1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
0.35 I 1 1 1 1 1 1 1 1 . 1 . 1 1 . .- 




1 2 3 4 5 6 7 6 9 10 11 12 13 14 15 




1 2 3 4 5 6 7 6 9 10 11 12 13 14 15 




1 2 3 4 5 6 7 S 9 10 11 12 13 14 15 



Fig. 2. Model probabilities for N — 1,5, 50, 100, 200 (from top to bottom). 
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Fig. 3. Instantaneous estimation of the model probabilities as a function of the iterations 
for N= 1,5,50,100,200. 
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Fig. 4. Snapshots of the transitions for ze between models for N = 100, together with 
their associated importance weights (the horizontal lines correspond to the true values of 
z ). 



THE PSEUDO-MARGINAL APPROACH 



29 



Figure 4 we present the trace of the z's drawn with our discretized Langevin 
while attempting to jump between models, which together with the evolu- 
tion of the values of the importance weights illustrates why our approach 
might be of interest. 

N/nb iter, x 1000 1/1000 5/400 50/40 100/50 200/50 
accept prob. 0.064293 0.16569 0.25885 0.28433 0.29371 

8. Conclusion. The pseudo-marginal approach to stochastic simulation 
is a highly versatile methodology which has diverse potential applications 
in a variety of areas. The focus of this paper has been on some of the the- 
oretical underpinnings of the method. Our main results describe ergodicity 
and uniform ergodicity of GIHM and its exact generalizations suggested in 
this paper, and we also give a comparison with an inexact variants, akin to 
MCWM. Empirical evidence in [1] and in the present paper in the context 
of reversible jumps for model selection in generalized linear models suggests 
that the methodology has considerable promise. Currently ongoing work 
confirms this. 
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